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1.  Abstract 

In  the  Phase  I  work  presented  here,  we  developed  computer  programs  for  the  Con¬ 
nection  Machine  (a  CM-2)  to  compute  sparse  matrix-vector  multiplies,  sparse  triangular 
solves,  and  inner-products.  These  subproblems  constitute  the  iterative  portion  of  Krylov 
space  methods  that  use  incompletely  factored  preconditioning  matrices  to  solve  very  large 
sparse  linear  systems.  The  numerical  solution  of  such  systems  is  very  often  the  compute¬ 
intensive  kernel  of  the  kind  of  large  scale  engineering  and  scientific  computations  that  are 
commonly  done  on  supercomputers.  Moreover,  Krylov  space  methods  which  are  precon¬ 
ditioned  by  appropriate  incompletely  factored  matrices  are  generally  recognized  as  the 
algorithms  of  choice  in  terms  of  minimizing  the  number  of  floating  point  operations  re¬ 
quired  for  solving  such  problems. 

We  performed  a  variety  of  benchmarks  on  large  three-dimensional  model  problems 
over  cube-shaped  domains  discretized  with  a  seven  point  template.  The  highest  compute 
tional  rate  we  were  able  to  achieve  for  the  sparse  triangular  solve  was  13.1  MFIops  on  4K 
processors;  this  would  correspond  to  a  timing  of  210  MFIops  on  an  appropriately  scaled 
problem  on  a  64K  processor  machine.  The  highest  computational  speed  we  achieved  for 
a  matrix  vector  multiply  was  64.2  MFIops,  this  would  correspond  to  to  a  speed  of  1027.0 
MFIops  on  a  64K  processor  machine.  We  compared  the  computational  speed  obtained 
from  the  CM-2  with  that  obtained  from  equivalent  highly-optimized  code  on  a  single  head 
of  a  Cray  X/MP.  For  all  three  computational  kernels  the  projected  speeds  on  a  full  64K 
processor  machine  exceed  the  Cray  X/MP  speeds  by  a  factor  of  at  least  5  to  6.  Thus  for 
regularly-structured  problems,  the  CM-2  achieves  impressive  computational  speeds.  When 
one  takes  into  account  the  price  differential  between  these  two-systems,  one  realizes  that 
the  price-performance  advantage  for  the  CM-2  over  a  single-headed  Cray  X/MP  is  truly 
outstanding. 

The  timings  on  the  CM-2  were  highly  dependent  upon  the  use  of  our  very  specialized 
programs.  These  programs  mapped  the  problem  domain  onto  the  processor  topology  very 
carefully  and  used  the  optimized  local  NEWS  communications  network.  We  explored  the 
consequences  of  relaxing  these  restrictions  in  a  variety  of  ways.  Performance  degraded 
very  significantly  as  we  abandoned  the  optimized  NEWS  communications  network  and 
abandoned  the  careful  mapping  of  a  problem  domain  to  the  CM-2  processor  topology.  In 
some  cases  we  noted  20  to  75  fold  degradations  in  performance  as  these  constraints  were 
relaxed. 


Scientific  Computing  Associates 


2.  Overview 

There  are  at  least  two-critical  components  required  to  obtain  extremely  fast  methods 
for  solving  linear  systems.  One  is  the  use  of  efficient  and  robust  numerical  algorithms, 
and  the  other  is  the  employment  of  effective  techniques  for  delivering  a  large  amount  of 
computing  power.  These  requirements  can  conflict  with  one  another  in  a  variety  of  ways. 
Many  modern  methods  of  solving  reasonably  general  classes  of  linear  systems  involve  a 
degree  of  implicitness;  this  implicitness  can  limit  the  amount  of  available  concurrency. 

When  Krylov  space  linear  solvers  are  used,  the  choice  of  preconditioner  can  play  a 
major  role  in  determining  the  operation  count  of  the  resulting  algorithm  [7,  10,  11,  12]. 
Unfortunately,  some  of  the  most  powerful  preconditioners  are  obtained  by  using  incom¬ 
pletely  factored  matrices.  To  apply  these  preconditioners,  we  rnu.it  repeatedly  solve  sparse 
triangular  systems.  The  efficiency  with  which  such  solutions  could  be  carried  out  was 
characterized  by  Saad  and  Schuitz  [Hj.  Sparse  triangular  systems  arising  from  a  range 
of  problems  have  been  solved  efficiently  on  a  number  of  shared  memory  architectures  [ij, 
[3],  [4],  [9].  Because  data  dependencies  limit  the  concurrency  available  from  a  sparse  tri¬ 
angular  solve,  it  has  not  been  clear  that  triangular  solves  could  be  employed  usefully  in 
programs  written  for  massively  parallel  architectures.  One  goal  of  our  study  was  to  de¬ 
termine  whether  one  could  realistically  hope  to  take  advantage  of  incompletely  factored 
matrices  as  preconditioners  for  solving  sparse  linear  systems  on  massively  parallel  archi¬ 
tectures  such  as  the  CM-2. 

Several  other  researchers  have  addressed  the  relative  utility  of  various  forms  of  pre¬ 
conditioning  on  the  CM-2.  Both  [8]  and  [13]  concluded  that  it  was  not  worthwhile  to  pre¬ 
condition  using  incompletely  factored  matrices  when  solving  linear  systems  arising  from 
two-dimensional  partial  differential  equations  on  massively  parallel  machines  such  as  the 
CM-2.  We  concentrated  our  efforts  on  a  simple  three-dimensional  model  problem  because 
of  the  greater  parallelism  found  in  the  sparse  triangular  solve. 

Preconditioning  based  on  polynomials  often  has  been  suggested  as  a  alternate  precon¬ 
ditioning  that  is  highly  parallel.  While  such  preconditioners  may  achieve  high  computation 
rates,  the  overall  picture  is  not  clear  since  they  often  require  more  iterations.  One  goal 
of  our  study  was  to  consider  whether  the  faster  computation  rate  of  a  polynomial  precon¬ 
ditioner  on  a  massively  parallel  machine  such  as  the  CM-2  would  overcome  the  greater 
number  of  iterations  required  to  converge  to  a  solution. 

In  the  work  described  in  this  Phase  I  final  report,  we  have  carefully  examined  a  set 
of  model  problems  and,  under  the  appropriate  circumstances,  we  are  able  to  obtain  sur¬ 
prisingly  good  performance  on  key  computational  kernels  including  the  sparse  triangular 
solves  used  for  preconditioning  in  Krylov  space  solvers.  To  put  our  results  in  perspective, 
we  compared  our  measured  and  projected  computational  speeds  with  results  from  experi¬ 
ments  on  a  single  head  of  a  Cray  X/MP.  We  concluded  that  for  selected  problems,  a  B4K 
processor  CM-2  could  outperform  an  analogously  well-tuned  program  run  on  a  single  head 
of  a  Cray  X/MP  by  at  least  a  factor  of  5  in  performing  the  Rparse  triangular  solves  and 
the  matrix  vector  multiplies  required  in  the  Krylov  solver  inner  loop. 

Efficient  methods  for  solving  partial  differential  equations  frequently  make  use  of  non- 
uniform  grids  designed  to  put  the  most  computational  effort  where  the  problem  is  hardest. 
An  effect  of  this  approach  is  that  the  algebraic  linear  (and  non-linear)  systems  that  must 
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eventually  be  solved  are  sparse  and  quite  irregular  in  structure.  Careful  mapping  of  work¬ 
load  can  be  extremely  important  in  obtaining  adequate  performance  from  multiprocessor 
architectures  with  strong  memory  hierarchies;  mapping  is  typically  straightforward  in  very 
regular  problems  with  a  known  structure  and  is  much  more  problematic  for  problems  with 
unknown  or  irregular  structures. 

Another  goal  of  our  Phase  I  research  was  to  quantify  the  degree  to  which  performance 
on  a  machine  such  as  *be  CM-2  depends  on  exploiting  regularities  in  problem  structure.  We 
will  present  a  number  of  timings  on  the  CM-2  for  sparse  matrix  vector  multiplies,  sparse 
triangular  solves  and  inner  products,  which  show  that  obtaining  good  performance  on  the 
CM-2  is  critically  dependent  on  the  use  of  very  well-tuned  special-purpose  computational 
kernels.  We  have  described  a  number  of  experimental  results  arising  from  our  investigations 
in  [5]  and  [6], 

In  Section  3,  we  present  what  we  regard  as  best  case  timings  for  the  sparse  matrix 
vector  multiplies,  sparse  triangular  solves,  inner  products  and  SAXPYs  that  can  constitute 
the  iterative  portion  of  Krylov-based  programs.  The  timings  show  that  for  large  three- 
dimensional  problems,  good  performance  can  be  obtained  from  sparse  triangular  solves. 
In  Section  4,  we  demonstrate  that  performance  on  the  CM-2  is  an  extremely  sensitive 
function  of  1)  problem  mapping  and  2)  a  priori  knowledge  of  dependency  patterns.  We 
will  show  that  this  performance  sensitivity  shows  up  very  strongiy  not  only  in  the  sparse 
triangular  solves  but  also  in  the  sparse  matrix  vector  multiply. 


3 


Scientific  Computing  Associates 


3.  Performance  on  a  Regular  Three-Dimensional  Mesh 

In  this  section  we  describe  the  results  of  experiments  that  give  a  best  case  estimate  of 
the  rate  with  which  the  CM-2  can  carry  out  the  portions  of  Krylov  space  linear  equation 
solvers.  We  will  first  present  timings  from  consecutive  sweep?  over  a  three-dimensional 
me^h  along  with  timings  for  the  corresponding  inner  products  and  SAXPYs  I  he  perfor¬ 
mance  measurements  we  obtain  here  characterize  the  performance  that  would  arise  from 
the  iterative  portions  of  linear  solvers  employing  many  simple  preconditioners.  We  will 
then  present  timing  results  from  a  program  that  perforins  a  sequence  of  linked  matrix 
vector  multiplies  and  triangular  solves.  We  argue  that  the  matrix  vector  multiply  and 
triangular  solve  timings  obtained  from  this  benchmark  are  a  fair  measure  of  the  timings 
that  would  be  observed  from  these  procedures  were  they  integrated  into  an  iterative  loop 
of  the  appropriately  preconditioned  Krylov  solver.  As  part  of  this  test  loop,  we  also  mea¬ 
sured  the  time  required  to  perform  inner  products  in  a  manner  that  conformed  with  data 
structures  and  the  mapping  used  for  the  other  two  procedures. 

The  software  provided  with  the  CM-2  makes  use  of  the  concept  of  virtual  processors ; 
one  can  program  the  CM-2  so  that,  it  appears  that  there  are  a  larger  number  of  processors 
than  actually  exist.  The  CM-2  software  assigns  blocks  of  virtual  processors  to  each  real 
processor.  This  assignment  of  multiple  virtual  processors  to  each  real  processor  tends  to 
amortize  the  overhead  of  transmitting  each  instruction  to  the  physical  processors.  In  most 
of  the  problems  we  investigated  in  Phase  I,  increasing  the  ratio  of  virtual  to  real  processors 
reduced  overheads  due  to  communication. 

3.1.  Mesh  Sweeps 

One  can  use  a  very  large  number  of  virtual  processors  in  implementing  a  sparse  matrix 
vector  multiply.  We  examined  the  performance  of  a  very  specialized  PARIS  program 
(PARIS  is  the  CM-2  assembly  language)  which  was  written  for  a  three-dimensional  problem 
on  a  cube  with  a  seven-point  operator.  The  program  consisted  of  a  sequence  of  sweeps 
over  a  three-dimensional  mesh.  The  mesh  was  embedded  into  a  cube  of  virtual  processors 
with  one  mesh  point  assigned  to  a  virtual  processor.  The  cube  of  virtual  processors  used 
by  the  program  has  an  edge  size  equal  to  a  power  of  two.  Subject  to  this  constraint, 
the  largest  mesh  we  could  embed  was  one  with  an  edge  size  of  64.  Each  iteration  of  the 
1000  carried  out  took  an  average  of  49  milliseconds.  This  corresponds  to  a  speed  of  64.2 
MFIops  on  the  4K  processors.  With  an  appropriately  scaled  problem,  one  might  expect 
to  obtain  1027.0  MFIops  on  a  64K  processor  machine  (appropriately  scaled  means  keeping 
the  virtual  processor  ratio  fixed).  The  results  obtained  from  timings  for  meshes  of  varying 
sizes  on  4K  processors  are  shown  in  Table  1. 

In  Table  1  we  also  depict  measurements  obtained  from  SAXPYs  and  inner  products 
over  three-dimensional  domains.  All  of  these  results  were  obtained  by  timing  100,000  con¬ 
secutive  iterations.  Because  SAXPYs  do  not  require  communication,  we  expect  to  obtain 
extremely  high  performance.  For  SAXPYs  carried  out  over  a  cube  of  virtual  processors 
with  edge  size  128,  we  obtained  a  speed  on  4K  processors  of  235.0  MFIops,  which  would 
correspond  to  3760  MFIops  on  a  64K  processor  machine.  The  efficiency  with  which  SAX¬ 
PYs  are  performed  decreases  when  one  uses  fewer  virtual  processors.  A  cube  with  edge 
size  16  has  one  virtual  processor  for  each  actual  processor.  From  TVdde  ;  „e  see  that  the 
speed  of  this  computation  decreases  to  131.0  MFIops  —  note  that  this  reduction  in  speed 
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(Jrid 

Size 

edge 

MVM 

MFlops 

SAXPY 

MFlops 

Inner  Product 
MFlops 

16 

23.5 

131.0 

14.6 

32 

48.1 

226.0 

81.3 

64 

64.2 

233.9 

185.8 

128 

N.A. 

235.0 

220.9 

Table  1:  Three-DimensSonai  Embedding 
Mesh  Sweeps,  Inner  Products,  SAXPYs 
4K  processors 


must  be  unrelated  to  communication  overhead.  In  Table  1,  we  also  present  timings  for 
inner  products  carried  out  over  a  cube  of  virtual  processors  with  varying  edge  size. 

3.2.  Performance  from  Iterative  Loops  with  Triangular  Solves 

We  next  present  timing  results  from  a  program  that  performs  a  sequence  of  linked 
matrix  vector  multiplies  and  triangular  solves.  Let  M  represent  a  matrix  obtained  from 
the  uniform  discretization  of  a  cube  with  a  seven  point  template  and  let  L  represent  a 
lower  triangular  matrix  with  the  same  sparsity  structure  as  M.  We  carried  out  the  test 
calculation  depicted  in  the  program  below. 

do  100  times 

Mx  —  y 

Solve  Lx  =  y 

z  —  inner-product(y.y) 

end  do 

This  program  carried  out  the  matrix  vector  multiply  Mx  —  y  by  sweeping  over  a 
three-dimensional  mesh.  We  embedded  the  three-dimensional  mesh  into  a  two-dimensional 
gray-coded  processor  lattice.  The  sparse  lower  triangular  system  of  equations  I,z  -  y  was 
then  solved  by  sweeping  over  another  three-dimensional  mesh,  embedded  in  a  conforming 
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fashion  into  the  same  two-dimensional  processor  lattice.  The  sweep  nsed  to  solve  the 
sparse  triangular  system  was  carried  out  in  a  manner  that  respected  the  dependencies  of 
the  problem.  As  part  of  the  test  loop,  we  also  measured  the  time  required  to  perform  inner 
products  in  a  manner  that  conformed  with  data  structures  and  the  mapping  used  for  the 
other  two  procedures. 

We  now  describe  in  more  detail  how  the  triangular  solve  was  carried  out.  In  a  cube 
with  n  points  along  any  edge,  i,j,k  between  1  and  n  are  used  to  define  the  position  of 
a  point  in  the  cube  where  t ,  j,  k  represent  the  cartesian  coordinates  of  a  point  in  the  3-1) 
mesh.  We  can  parallelize  this  three-dimensional  triangular  solve  by  concurrently  solving, 
for  each  consecutive  t>,  the  plane  of  points  satisfying  the  condition  »  f  j  (  k  —  v  for 
1  <  u  <  3n  —  2.  Each  processor  in  the  lattice  contained,  for  a  given  t  and  j,  variables 
corresponding  to  values  of  k  between  1  and  n. 

Table  2  depicts  the  timings  obtained  for  various  size  domains  for  4K  processors,  tim¬ 
ings  from  100  iterations  were  averaged.  The  timings  obtained  from  the  cube  with  edge 
sizes  128  and  64  corresponded  to  188.9  and  104.9  MFlops  respectively  for  the  inner  prod¬ 
uct,  52.6  and  27.1  MFlops  respectively  for  the  matrix  vector  product  and  13.1  and  7.0 
MFlops  respectively  for  the  triangular  solve.  From  the  above  results  we  conclude  that  for 
an  appropriately  scaled  problem  on  64K  processors,  the  fastest  results  we  could  obtain  for 
the  inner  product,  matrix  vector  multiply  and  triangular  solve  would  be  3022  MFlops,  842 
MFlops  and  210  MFlops  respectively. 


0  rid 
Edge 
Size 

Inner 

Product 

(ms) 

1 

■uSH 

Triangular 

Solve 

(ms) 

Total 

Time 

(",0) 

16 

15 

29.0 

51.9 

81.3 

I  32 

56.7 

106.8 

162.4 

i  48 

78.4 

169.1 

240.6 

64 

5.0 

115.9 

225.7 

321.5 

22.2 

478.3 

9G0.1 

1511.5 

Table  2:  Matrix  Vector  Multiply,  Triangular  Solve,  Inner  Product 

Optimized 

4K  processors 
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We  were  able  to  prevent  the  need  to  move  data  when  we  followed  the  the  matrix 
vector  multiply  by  a  sparse  triangular  solve  because  of  the  way  in  which  wre  assigned  data 
to  processors.  This  method  of  data  assignment  did  have  the  side  effect  of  requiring  us  to 
perform  the  sparse  matrix  vector  multiply  in  n  consecutive  phases.  As  one  might  expert, 
the  use  of  this  embedding  can  be  seen  to  exact  a  performance  penalty  when  compared  to 
the  embedding  discussed  in  Section  3.1.  For  example  we  attained  a  computational  speed  of 
64.2  MFIops  for  the  problem  with  edge  size  64.  In  Section  3.1,  we  attain  a  computational 
speed  of  only  27.1  MFIops  with  the  embedding  discussed  in  this  section.  The  conforming 
inner  product  also  had  to  be  performed  in  n  consecutive  phases. 

3.3.  Comparison  with  Cray  Timings 

To  ut  the  CM-2  timings  presented  above  into  perspective  we  will  present  timings  from 
a  single  head  of  a  Cray  X /MP  for  triangular  solves  and  matrix  vector  multiplies  arising  from 
problems  with  the  same  structure  as  the  problem  we  presented  above.  We  used  PCCPAK*, 
a  commercially  available  Krylov  space  solver  that  handles  general  sparse  matrices.  In  this 
program,  the  computation  of  the  matrix  vector  multiply  and  triangular  solve  in  the  iterative 
loop  were  vectorized  using  Cray  Assembly  Language  (CAL).  Computational  speeds  of  24 
MFIops  and  22  MFIops  were  obtained  when  performing  matrix  vector  multiplies  that  arose 
from  meshes  with  edge  sizes  20  and  30  respectively.  Note  that  the  speed  of  the  Cray  did 
not  increase  with  increasing  problem  size;  on  a  vector  machine  such  as  the  Cray  X/MP, 
the  computational  rate  depends  chiefly  on  the  problems  structure  rather  than  its  size.  Ily 
using  a  more  specialized  data  structure  that  was  w-ell  suited  for  vectorization,  Ashcraft  and 
Grimes  [2]  report  speeds  of  150  MFIops  on  a  single  head  of  an  Cray  X/MP  for  a  matrix 
vector  product.  The  speed  of  the  matrix  vector  multiply  in  the  largest  problem  described 
in  Section  3.2  was  52.6  MFIops  for  the  4K  processor  machine.  This  speed  should  increase 
to  842  MFIops  in  a  64K  processor  machine  on  an  appropriately  scaled  problem. 

The  triangular  solve  comparisons  were  also  very  favorable.  For  meshes  with  edge  sizes 
of  20  and  30  a  single  head  of  an  X/MP  we  achieved  computational  rates  of  1 3  and  1 2  MFIops 
respectively  using  PCGPAK.  Ashcraft  and  Grimes  [2]  report  speeds  of  25-40  MFIops  on 
triangular  solves  in  problems  with  the  same  structure;  again  their  higher  computat ional 
rates  were  achieved  through  the  use  of  specialized  data  structures.  The  computational 
speed  achieved  by  the  largest  problem  described  in  Section  3.2  was  13.1  MFIops,  which 
should  increase  to  210  MFIops  in  a  full  64K  machine  for  an  appropriately  scaled  problem. 

3.4.  Summary  of  Performance  on  a  Regular  Three-Dimensional  Mesh 

The  benchmarks  described  in  this  section  imply  that  one  can  achieve  respectable 
performance  when  one  is  able  to  make  use  of  a  highly  tuned,  special  purpose  PARIS 
program  for  carrying  out  the  iterative  portion  of  a  large  three-dimensional  computation. 
Our  experiences  with  the  higher  level  languages  *lisp  and  *C  were  less  satisfactory.  For 
instance,  our  *  lisp  versions  of  the  sparse  triangular  solve  described  above  required  a  factor 
of  20  more  time  to  run  than  did  the  PARIS  version. 

3.5.  Polynomial  Preconditioning 

The  relatively  slow  s  ,ceds  of  the  general  router  on  the  connection  machine,  as  well  a.i 
the  relatively  small  parallelism  in  incomplete  factorizations  done  in  a  fully  “data-parallel" 

•PCGPAK  l«  «  registered  trndemftrk  of  Scientific  Computing  AMocinted,  Inc.,  New  Hftven,  CT 
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'lest 

Problem 

GMRES 

(K) 

Number  of 
Iterations 

SPEl 

100 

718 

SPE2 

100 

>  2000 

SPE3 

100 

>  2000 

SPE4 

100 

120 

SPE5 

100 

>  2000 

5-pt 

100 

265 

9-pt 

100 

HO 

7-pt 

100 

86 

L9-pt 

100 

403 

L7-pt 

50 

263 

Table  3:  Observed  iteration  counts  for  a  selection  of 
test  problems  using  no  preconditioning  in  the  GMRES 
method. 

way  suggest  that  one  may  do  better  by  using  simpler  preconditioned  that  vectorize  well. 
An  obvious  choice  for  such  a  preconditioners  is  a  polynomial  preconditioner  since  it  can 
be  computed  with  highly  parallel  matrix  vector  products. 

Because  the  particular  choice  of  polynomial  can  have  a  significant  effect  on  the  ef¬ 
fectiveness  of  the  preconditioner,  we  looked  instead  at  the  performance  of  the  GMRES 
method  with  no  preconditioner  but  with  a  large  number  of  “direction  vectors.”  Roughly, 
the  GMRES  method  minimizes  the  residual  over  combinations  of  A't  for  t  —  0, .  .  . ,  It  -  1 , 
and  hence  in  some  measure  uses  in  d  iterations  the  best  possible  polynomial  of  degree  d. 
The  test  problems  included  the  usual  5-  and  9-point  two-dimensional  model  problems,  a 
7-point  three-dimensional  model  problem,  and  a  set  of  “real-world”  problems  (the  SPE 
set).  By  comparing  results  using  GMRES  with  many  direction  vectors  (large  k)  with  GM¬ 
RES  with  incomplete  factorizations,  we  can  estimate  the  number  of  additional  iterations 
that  a  polynomial  preconditioning  would  take  (divide  the  number  of  iterations  of  GMRES 
without  preconditioning  by  the  degree  of  the  polynomial  to  get  an  estimate  of  the  number 
of  iterations  with  polynomial  preconditioning). 

From  our  results,  we  can  estimate  the  relative  cost  between  using  a  highly  parallel 
polynomial  preconditioning  and  a  less  parallel  but  more  effective  incomplete  factorization. 

3.6.  Brief  Description  of  the  Test  Problems 

In  this  section,  we  present  the  eight  test  problems  used  in  our  experiments. 


Problem  1  This  problem  models  the  pressure  equation  in  a  sequential  black  oil  simulation. 
(SPE1)  The  grid  is  10  x  10  x  10  with  one  unknown  per  gridpoint  for  a  total  of  1000 
unknowns. 


Problem  3  This  problem  arises  from  the  thermal  simulation  of  a  steam  injection  process. 
(SPE2)  The  grid  is  6  x  6  x  5  with  6  unknowns  per  grid  point  giving  1080  unknowns. 
The  matrix  is  a  block  seven  point  operator  with  6x6  blocks. 
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Problem  3 
(SPE3) 


Problem  4 
(SrE4) 


Problem  5 
(SPE5) 


Problem  6 
(5-Pt) 
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Test 

Problem 

GMRES 

(K) 

Preconditioning 

Number  of 

iterations 

SPE1 

1 

IIAJ(O) 

70 

SPE2 

1 

ILU(0) 

23 

SPE.3 

10 

MILU(0) 

51 

SPE4 

10 

ILU(0) 

39 

SPE5 

20 

ILU(0) 

104 

5-pt 

5 

MILU(0) 

44 

9-pt 

10 

ILU(2) 

47 

7-pt 

1 

ILU(0) 

39 

1, 9-pt 

10 

IbU(l) 

80 

L7-pt 

1 

ILU(0) 

31 

Table  4:  Observed  iteration  counts  for  a  selection  of  test, 
problems  using  incomplete  factorization  preconditioners 
in  the  GMRES  method. 


This  problem  comes  from  an  IMPES  simulation  of  a  black  oil  model.  The 
matrix  is  a  seven  point  operator  on  a  35  x 11 x  13  grid  yielding  5005  equations. 


This  problem  also  comes  from  an  IMPES  simulation  of  a  black  oil  model.  The 
matrix  is  a  seven  point  operator  on  a  16  x  23  x  3  grid  giving  1 104  equations. 


This  problem  arises  from  a  fully-implicit,  simultaneous  solution  simulation  of 
a  black  oil  model.  It  is  a  block  seven  point  operator  on  a  16  x  23  x  3  grid 
with  3x3  blocks  yielding  3312  equations. 


This  problem  is  a  five  point,  central  difference  discretization  of  the  following 
equation  on  the  unit  square: 


r“)  “  r(<IS,r") 1 2(r  +  v)(ru  1  ru) 1  (2 1  — )u  -  / 

OX  dy  dy  ,ydx  dy  '  v  1  t  x  I  y‘ 

with  Dirichlet  boundary  conditions  and  /  chosen  so  that  the  exact  solution  is 
u  —  x  e.xy  8in(7ri)  sin(7ry). 


The  discretization  grid  is  63  x  63  giving  3969  unknowns.  The  I,5-pt  problem 
is  the  same  problem  with  a  200  x  200  grid. 
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Problem  7  This  problem  is  a  nine  point  box  scheme  discretization  for  the  following  equa- 
(9-pt)  tion  on  the  unit  square: 

,  d2  d2  ,  n  d  3 

“(5?“  +  5P“)  +  23i“  '  ~  1 

with  Dirichlet  boundary  conditions  and  f  chosen  so  that  the  exact  solution  is 
u  =  z  exv Bin(nx)  Bin(ny). 

The  discretization  grid  is  63  x  63  giving  3969  equations.  The  L9-pt  problem 
is  the  same  problem  with  a  127  x  127  grid. 


Problem  8  This  problem  is  a  seven  point  central  difference  discretization  of  the  following 
(7-pt)  equation  on  the  unit  cube: 

d  .  Til  d  d  ,  _  d  d  .  d  .  „  ,  ,  d  .  1  . 

-~-(e  V“  "T  *  Va~u)~x~ («  VT" «)  +  80(x  +  y  +  «)  — a  +  (40+  — - - - - — )u  =  / 

ox  ox  ay  ay  dz  dz  ox  1  -f  x  -f-  y  4-  z 

with  Dirichlet  boundary  conditions  and  /  chosen  so  that  the  exact  solution  is 

u  =  (1  -  rr)(l  -  y)(l  -  z){  1  -  e~x)(l  -  e~y)(l  -  e~*). 

The  discretization  grid  is  20  X  20  x  20  yielding  8000  equations.  The  L7-pt 
problem  is  the  same  problem  with  a  30  x  30  x  30  grid. 


3.7.  Summary  of  Preconditioning  Results 

Table  3  shows  the  observed  iteration  counts  for  unpreconditioned  GMRES  with  are 
large  number  of  direction  vectors.  Table  4  shows  the  observed  iteration  counts  for  the 
same  problems,  using  various  flavors  of  incomplete  factorization.  In  particular,  the  “real- 
world”  SPE  problems  (except  SPE4)  take  more  than  10  times  as  many  iterations  using  no 
preconditioning  as  when  using  incomplete  factorizations.  Very  roughly,  this  means  that 
using  a  tenth-degree  polynomial  as  a  preconditioner  would  give  at  best  a  similar  iteration 
count  (note  that  many  of  the  SPE  problems  did  not  converge  without  preconditioning). 
Thus,  an  incomplete  factorization  that  took  as  much  time  as  such  10  matrix  vector  products 
would  still  be  competitive  and  perhaps  preferable  (because  of  the  better  apparent  behavior 
of  incomplete  factorizations  on  a  wide  range  of  problems). 

It  is  also  important  to  note  that,  in  contrast,  the  simple  model  problems  are  solved 
fairly  eflectb’ely  by  GMRES  with  no  preconditioning,  and  hence  polynomial  precondition¬ 
ing  may  be  effective  for  these  model  problems.  Still,  we  are  interested  in  using  a  massively 
parallel  machine  for  solving  large  and  difficult  linear  systems,  and,  as  long  as  the  incom¬ 
plete  factorization  solves  can  be  done  with  reasonable  efficiency,  they  are  a  better  choice 
of  preconditioner. 
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4.  The  Importance  of  Careful  Embedding 

In  Section  3  we  described  how  it  was  possible  to  achieve  respectable  rates  of  compu¬ 
tation  even  on  iterative  loops  that  included  a  sparse  triangular  solve.  In  this  section,  we 
present  some  benchmarks  that  yield  insight  on  what  is  required  for  achieving  this  high 
performance. 

4.1.  The  Consequences  of  a  Poor  Mapping 

We  have  found  that  20  to  75  fold  performance  differences  can  be  observed  when  we 
compared  the  performance  of  two  versions  of  a  sparse  matrix  vector  multiply  program 
(Table  5).  The  problem  consisted  of  sweeps  over  sparse  matrices  generated  by  square 
domains  of  varying  sizes  with  five  point  templates.  The  first  version  is  explicitly  mapped 
onto  the  machine  in  a  way  that  allows  us  to  utilize  the  CM-2's  fast  NEWS  network  for  local 
communication.  The  other  version  uses  a  general  router  designed  to  carry  out  arbitrary 
patterns  of  interprocessor  communication.  Both  versions  were  programmed  in  *I,isp  and 
used  the  same  data  structures  to  represent  the  sparse  matrices.  For  the  256  by  256  problem 
the  timings  for  the  explicitly  mapped  NEWS  network  code  corresponded  to  a  speed  of 
40.0  MFlops  —  the  timings  for  the  router  version  of  the  code  achieved  a  speed  of  0.5 
MFlops.  Note  that  the  cost  of  computation  does  not  vary  significantly  with  the  mapping 
and  choice  of  communications  method;  this  consistency  provides  a  reassurance  that  the 
timing  differences  noted  are  actually  due  to  communication  related  costs.  The  timings 
depicted  here  are  averages  obtained  from  1000  iterations  of  the  matrix  vector  multiply 
code.  From  this  benchmark,  it  is  quite  clear  that  satisfactory  performance  cannot  be 
obtained  even  from  the  most  rudimentary  sparse  matrix  code  if  one  were  to  map  sparse 
matrices  onto  the  CM-2  without  regard  to  data  dependency  patterns. 

4.2.  The  Consequences  of  a  Using  the  Genera]  Router  in  a  Well-Mapped  Froblem 

We  next  attempt  to  obtain  some  rough  estimates  of  the  performance  we  could  expect 
if  we  were  to  write  an  iterative  loop  of  a  Krylov  linear  solver  in  a  program  capable  of 
mapping  reasonably  general  sparse  matrices  onto  the  CM-2.  We  must  assume  that  the 
sparse  matrix  could  arise  from  a  mesh  with  irregularities.  We  consequently  must  use  the 
general  router  rather  than  the  NEWS  network. 

We  will  present  benchmarks  that  attempt  to  quantify  the  effects  of  using  the  general 
router  on  a  well-mapped  problem.  This  should  give  us  a  best  case  estimate  of  the  per¬ 
formance  we  might  expect  from  using  the  general  router.  We  examined  the  performance 
of  versions  of  the  PARIS  program  described  in  Section  3.2  in  which  we  performed  data 
fetches  or  data  sends  using  the  general  router  instead  of  performing  data  sends  using  the 
NEWS  network.  Note  that  in  all  of  these  cases,  the  problem  was  mapped  so  that  only 
nearest  neighbor  communications  were  needed.  The  timings  we  obtained  with  data  fetches 
carried  out  using  the  general  router  can  be  interpreted  as  best  case  estimates  of  what  one 
could  expect  from  a  CM-2  executor  that  did  not  have  access  to  a-priori  information  on 
dependency  patterns.  Table  6  depicts  the  average  time  per  iteration  required  to  solve  a 
problem  over  a  64  by  64  by  64  grid.  The  time  was  averaged  over  100  iterations.  Note  that 
in  this  case  there  is  a  roughly  eight-fold  performance  difference  between  the  optimized 
NEWS  network  program  and  the  version  employing  the  general  router  fetch  instruction. 
As  we  mentioned  in  Section  3,  on  the  4K  processor  machine,  the  program  employing  the 
NEWS  network  achieves  a  speed  of  27.14  MFlops  on  the  matrix  vector  multiply  and  a 
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Grid 

Size 

News 

Total 

(ms) 

News 

Comp 

(ms) 

General 

Total 

(ms) 

General 

Comp. 

(ms) 

64  x  64 

2.25 

0.93 

45.56 

0.99 

128  x  128 

4.62 

1.88 

189.44 

1.89 

256  x  256 

13.11 

5.46 

1001.11 

5.51 

Tabic  5:  Matrix  Vector  Multiply 

Explicitly  mapped  News  Net  vs.  General  Router 

4K  processors 


speed  of  6.97  MFlops  on  the  triangular  solve.  The  fetch  version  of  the  general  router  pro¬ 
gram  achieves  a  speed  of  3.58  MFlops  on  the  matrix  vector  multiply  and  1.14  MFlops  on 
the  triangular  solve. 

In  Table  7  we  compare  timings  obtained  through  the  use  of  the  NEWS  network  and 
the  fetch  instruction  from  the  general  router  for  three-dimensional  meshes  with  varying 
edge  size.  We  also  present  the  ratio  of  the  fetch  timings  to  the  NEWS  net  timings.  For 
both  the  matrix  vector  multiply  and  the  triangular  solve,  the  ratio  between  the  NEWS  net 
and  router  timings  remain  roughly  constant  for  all  meshes  with  edge  size  up  to  64.  The 
router  timings  become  relatively  less  efficient  for  the  mesh  with  edge  size  128.  When  we 
solVe  this  problem  with  a  128  edge  sized  mesh,  4  virtual  processors  are  assigned  to  each 
actual  processor.  As  described  above,  the  assignment  of  multiple  virtual  processors  to  each 
actual  processor  can  have  the  effect  of  reducing  communications  overhead;  less  information 
needs  to  be  exchanged  between  processors.  When  the  router  is  used,  each  virtual  processor 
must  go  through  the  overhead  of  using  the  router,  even  when  only  sending  information  to 
another  virtual  processor  assigned  to  the  same  actual  processor. 
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Function 

NEWS 

(ms) 

Router 

Send 

(ms) 

Router 

Fetch 

(ms) 

Matrix  Vector 

116 

249 

876 

Solve 

226 

374 

1376 

Total 

342 

606 

2247 

Table  6:  Three-Dimensional  Mesh  Sweep  and  Solve 
News  Net  vs.  Send  and  Fetch  General  Router 
4K  processors 


5.  Conclusions 

In  Section  3,  we  presented  what  we  might  regard  as  best  case  timings  for  the  sparse 
matrix  vector  multiplies,  sparse  triangular  solves,  and  inner  products  that  constitute  the 
iterative  portion  of  Krylov  space  linear  solvers  when  the  solvers  use  incompletely  factored 
matrices  for  preconditioning.  We  performed  timings  for  large  three-dimensional  model 
problems  over  cube-shaped  domains  discretized  with  a  seven  point  template.  The  highest 
computational  rate  we  were  able  to  achieve  for  the  sparse  triangular  solve  was  13.1  MFIops 
on  4K  processors,  which  would  correspond  to  a  timing  of  210  MFIops  on  an  appropriately 
scaled  problem  on  a  64 K  processor  machine.  The  highest  computational  speed  we  achieved 
for  a  matrix  vector  multiply  was  61.2  MFIops,  which  would  correspond  to  to  a  speed  of 
1027.0  MFIops  in  a  64K  processor  machine.  Thus  for  regularly  structured  problems,  the 
CM-2  achieves  impressive  computational  speeds.  For  all  three  compute-intensive  kernels 
the  projected  speeds  on  a  full  64K  processor  machine  exceed  the  speeds  that  can  be  achieved 
by  carefully  vectorized  code  on  a  single  head  of  a  Cray  X/MP  by  a  factor  of  at  least  5  to 

6.  T!  <*se  timings  indicate  that  we  can  build  a  preconditioned  Krylov  space  solver  which 
is  both  close  to  optimal  with  respect  to  the  required  number  of  floating  point  operations 
and  yet  performs  at  a  rate  of  computation  far  exceeding  that  of  a  highly  optimized  version 
running  on  a  single-headed  Cray-X/MP. 

The  timings  on  the  CM-2  were  highly  dependent  on  the  use  of  our  specialized  pro¬ 
grams.  These  programs  mapped  the  problem  domain  onto  the  processor  topology  very 


13 


Scientific  Computing  Associates 


Edge 

Size 

MVM 

NEWS 

(ms) 

MVM 

Router 

(ms) 

MVM 

Ratio 

Solve 

NEWS 

(ms) 

Solve 

Router 

(ms) 

Solve 

Ratio 

lf» 

29 

208 

7.2 

52 

57 

441 

7.7 

107 

116 

876 

7.6 

220 

1 

6.1 

478 

4581 

9.6 

961 

7.3 

Table  7:  Th  rce-Dimensional  Mesh  Sweep  and  Solve 

News  Net  vs.  General  Router — Varying  Mesh  Si/e  4K  processors 


carefully  and  used  the  optimized  local  NEWS  communications  network.  We  explored  the 
consequences  of  relaxing  these  restrictions  in  a  variety  of  ways.  Performance  degraded 
very  significantly  as  one  abandoned  the  optimized  NEWS  communications  network  and 
abandoned  the  careful  mapping  of  a  problem  domain  to  the  CM-2  processor  topology.  In 
some  cases  we  noted  20  to  75  fold  degradations  in  performance  as  these  constraints  were 
relaxed. 

The  results  of  our  benchmarks  clearly  demonstrate  that  one  can  obtain  extremely 
high  performance  on  Krylov  space  linear  solvers  on  the  CM-2.  It  appears  that  it  would  be 
quite  feasible  to  construct  a  preconditioned  Krylov  solver  capable  of  solving  systems  arising 
from  partial  differential  equations  discretized  using  one  of  a  fixed  set  of  finite  difference 
templates  and  obtain  extraordinary  performance  on  the  CM-2.  Taking  into  account  the 
relative  performance  of  such  a  solver  on  the  CM-2  and  the  Cray-X/MP  and  the  relative 
prices  of  these  computer  systems,  an  appropriate  preconditioned  Krylov  solver  on  the  CM- 
2  would  revolutionize  the  computational  solution  of  engineering  and  scientific  problems 
involving  three-dimensional  simulations. 
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